133 research outputs found
Non-linear Attributed Graph Clustering by Symmetric NMF with PU Learning
We consider the clustering problem of attributed graphs. Our challenge is how
we can design an effective and efficient clustering method that precisely
captures the hidden relationship between the topology and the attributes in
real-world graphs. We propose Non-linear Attributed Graph Clustering by
Symmetric Non-negative Matrix Factorization with Positive Unlabeled Learning.
The features of our method are three holds. 1) it learns a non-linear
projection function between the different cluster assignments of the topology
and the attributes of graphs so as to capture the complicated relationship
between the topology and the attributes in real-world graphs, 2) it leverages
the positive unlabeled learning to take the effect of partially observed
positive edges into the cluster assignment, and 3) it achieves efficient
computational complexity, , where is the vertex size, is
the attribute size, is the number of clusters, and is the number of
iterations for learning the cluster assignment. We conducted experiments
extensively for various clustering methods with various real datasets to
validate that our method outperforms the former clustering methods regarding
the clustering quality
Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spaces
In many fields, e.g., data mining and machine learning, distance-based outlier detection (DOD) is widely employed to remove noises and find abnormal phenomena, because DOD is unsupervised, can be employed in any metric spaces, and does not have any assumptions of data distributions. Nowadays, data mining and machine learning applications face the challenge of dealing with large datasets, which requires efficient DOD algorithms. We address the DOD problem with two different definitions. Our new idea, which solves the problems, is to exploit an in-memory proximity graph. For each problem, we propose a new algorithm that exploits a proximity graph and analyze an appropriate type of proximity graph for the algorithm. Our empirical study using real datasets confirms that our DOD algorithms are significantly faster than state-of-the-art ones.Amagata D., Onizuka M., Hara T.. Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spaces. VLDB Journal 31, 797 (2022); https://doi.org/10.1007/s00778-022-00729-1
Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach
Distance-based outlier detection is widely adopted in many fields, e.g., data mining and machine learning, because it is unsupervised, can be employed in a generic metric space, and does not have any assumptions of data distributions. Data mining and machine learning applications face a challenge of dealing with large datasets, which requires efficient distance-based outlier detection algorithms. Due to the popularization of computational environments with large memory, it is possible to build a main-memory index and detect outliers based on it, which is a promising solution for fast distance-based outlier detection. Motivated by this observation, we propose a novel approach that exploits a proximity graph. Our approach can employ an arbitrary proximity graph and obtains a significant speed-up against state-of-the-art. However, designing an effective proximity graph raises a challenge, because existing proximity graphs do not consider efficient traversal for distance-based outlier detection. To overcome this challenge, we propose a novel proximity graph, MRPG. Our empirical study using real datasets demonstrates that MRPG detects outliers significantly faster than the state-of-the-art algorithms
Scaling Manifold Ranking Based Image Retrieval
Manifold Ranking is a graph-based ranking algorithm being successfully applied to retrieve images from multimedia databases. Given a query image, Manifold Ranking computes the ranking scores of images in the database by exploiting the relationships among them expressed in the form of a graph. Since Manifold Ranking effectively utilizes the global structure of the graph, it is significantly better at finding intuitive results compared with current approaches. Fundamentally, Manifold Ranking requires an inverse matrix to compute ranking scores and so needs O(n^3) time, where n is the number of images. Manifold Ranking, unfortunately, does not scale to support databases with large numbers of images. Our solution, Mogul, is based on two ideas: (1) It efficiently computes ranking scores by sparse matrices, and (2) It skips unnecessary score computations by estimating upper bounding scores. These two ideas reduce the time complexity of Mogul to O(n) from O(n^3) of the inverse matrix approach. Experiments show that Mogul is much faster and gives significantly better retrieval quality than a state-of-the-art approximation approach
Beyond Real-world Benchmark Datasets: An Empirical Study of Node Classification with GNNs
Graph Neural Networks (GNNs) have achieved great success on a node
classification task. Despite the broad interest in developing and evaluating
GNNs, they have been assessed with limited benchmark datasets. As a result, the
existing evaluation of GNNs lacks fine-grained analysis from various
characteristics of graphs. Motivated by this, we conduct extensive experiments
with a synthetic graph generator that can generate graphs having controlled
characteristics for fine-grained analysis. Our empirical studies clarify the
strengths and weaknesses of GNNs from four major characteristics of real-world
graphs with class labels of nodes, i.e., 1) class size distributions (balanced
vs. imbalanced), 2) edge connection proportions between classes (homophilic vs.
heterophilic), 3) attribute values (biased vs. random), and 4) graph sizes
(small vs. large). In addition, to foster future research on GNNs, we publicly
release our codebase that allows users to evaluate various GNNs with various
graphs. We hope this work offers interesting insights for future research.Comment: Accepted to NeurIPS 2022 Datasets and Benchmarks Track. 21 pages, 15
figure
Scardina: Scalable Join Cardinality Estimation by Multiple Density Estimators
In recent years, machine learning-based cardinality estimation methods are
replacing traditional methods. This change is expected to contribute to one of
the most important applications of cardinality estimation, the query optimizer,
to speed up query processing. However, none of the existing methods do not
precisely estimate cardinalities when relational schemas consist of many tables
with strong correlations between tables/attributes. This paper describes that
multiple density estimators can be combined to effectively target the
cardinality estimation of data with large and complex schemas having strong
correlations. We propose Scardina, a new join cardinality estimation method
using multiple partitioned models based on the schema structure
- …